Depth and Autonomy



A Framework for Evaluating LLM Applications in Social Science Research

—APSA 2025—

Ali Sanaei & Ali Rajabzadeh

Who wins mano a maquina?

“The cheese that the mouse that the cat that the dog that the boy that the teacher that the principal that the inspector noted reported warned scolded chased caught ate was moldy.”


Fill in the blanks: ___ had scolded ___.

Outline

  • We propose a framework which has two dimensions depth and autonomy.
  • We develop a questionnaire to evaluate existing social science research and apply it to all published social science research that uses LLMs as tools. We get three dimensions out of this: depth, autonomy, and transparency.
  • We present experimental results by which we try to say autonomy is manageable.

The Rise of LLMs in Social Science


Current State:

  • Research is asked to be “validated” with some editor-dependent way.
  • “Interesting” work is seeping through the filters.
  • We seem to even lack the language to talk about how these models are used.

Issues:

  • Validity
  • Reliability
  • Replicability

Our Proposal: A Guiding Framework

We aim for higher research quality, transparency, and preserved human control.

Our Proposal: A two-dimensional framework to help researchers classify, recommend, and evaluate the use of LLMs in their work.


The goal is to reap the benefits of LLMs while preserving transparency and reliability.

Characterizing LLM Usage: Many Dimensions

To build a useful framework, we must first understand the many ways LLM usage can vary. Let’s explore several key dimensions.

Dimension: Scope of Analysis

Refers to the unit of analysis on the input side.

  • Word/Token: Part-of-speech tagging, named entity recognition.
  • Sentence: Sentiment analysis.
  • Paragraph/Chunk: Summarizing a specific section or a social media post.
  • Document: Classifying an entire article.
  • Corpus: Synthesizing themes across multiple documents.


Example: Moving from analyzing sentiment in a single tweet (Sentence) to identifying overarching themes in thousands of interview transcripts (Corpus).

Dimension: Reasoning Load

Indexes whether a task requires simple retrieval or complex, multi-step inference.

  • Simple Recall: Extracting a date or name explicitly mentioned in a text.
  • Multi-step Reasoning: Applying a complex coding rubric that requires checking multiple conditions before assigning a label.


Example:

  • Low Load: “What state contains Albuquerque?”
  • High Load: “Name all states that start with the same letter as the state containing Albuquerque, but do not contain Albuquerque.”

Dimension: Task Novelty

How familiar is the task to the model?

  • In-training: Resembles tasks seen during training (e.g., summarizing news).
  • Novel: A genuinely new problem or a unique combination of concepts.

Dimension: Analytical Logic

Describes whether the analytical categories are fixed beforehand or emerge from the data.

  • Deductive (Fully Predefined):
    • Applying a fixed, pre-existing codebook to a set of interviews.
    • No new codes are allowed.
  • Inductive (Fully Emergent):
    • Performing open coding on focus group transcripts to generate themes from scratch.
    • The categories are an output of the analysis, not an input.

Dimension: Iteration

Captures whether the research pipeline is a single-pass or multi-pass process.

  • Single-Pass: The model executes the entire analytical task in one step.
    • Example: A single prompt to code an entire interview.
  • Multi-Pass (Iterative): The task is decomposed into sequential or parallel steps.
    • This allows for human review, refinement, and greater control.
    • Example: A multi-stage pipeline that first extracts quotes, then clusters them, and finally synthesizes themes.

Dimension: Epistemology

Situates the underlying philosophical stance of the research.

  • Positivist:
    • Assumes an objective reality to be measured.
    • Emphasis on quantifiable tasks like content analysis with a predefined codebook, where replicability is key.
  • Interpretivist:
    • Assumes reality is socially constructed and subjective.
    • Emphasis on exploring potential meanings, surfacing ambiguities, and generating initial interpretations for deeper human analysis.

Why These Two Dimensions?

Interpretive Depth

  • maps onto the spectrum of qualitative methodologies, from descriptive content analysis to deep hermeneutics.
  • is an intrinsic feature of the research question itself.

Realized Autonomy

  • is the most consequential for evaluating the reliability and safety of the analysis.
  • is a feature of the execution:the pipeline and workflow choices made by the researcher.

Focusing on Depth and Autonomy

Interpretive Depth

  • The kind of inference the model is asked to perform.
  • Ranges from surface-level extraction to deep hermeneutic analysis.
  • Set by the research question.

Realized Autonomy

  • The extent to which consequential choices are made by the model.
  • Ranges from a simple tool to a delegated trustee.
  • Set by the research pipeline.

The Autonomy-Depth Plane

A visual guide for designing and evaluating LLM applications.

  • Low-autonomy configurations are safer, even for high-depth tasks.
  • As tasks require deeper interpretation, the temptation to grant more autonomy increases.
  • The top-right quadrant represents a High-Risk Zone, where high model autonomy is combined with deep, nuanced interpretation.

The Bounded-Autonomy Principle

Treat LLMs as capable but fallible research assistants, not as oracles.

  • Decompose complex tasks into manageable, auditable steps.
  • Provide clear rubrics, worked examples, and structured outputs.
  • Require citations and direct textual evidence for all claims.
  • Reserve critical interpretive decisions and conflict resolution for the human researcher.

Strategy: Task Decomposition

Breaking down complex tasks is key to maintaining low autonomy while achieving high depth.

Vertical Decomposition

  • Sequential subtasks.
  • Output of one stage is input for the next.
  • Example:
    1. Extract evidence
    2. Cluster codes
    3. Synthesize themes

Horizontal Decomposition

  • Parallel subtasks.
  • Run across text segments or analytical dimensions.
  • Example:
    • Analyze for “rule of law”
    • Analyze for “accountability”

Surveying the Field

How are LLMs actually being used in published social science research?

We systematically coded 56 published articles to map the current state of the literature onto our framework.

Survey Findings: A Diverse Landscape

Our analysis of published papers reveals:

  • Wide variation in how researchers use LLMs.
  • Studies are scattered across the Autonomy-Depth plane.
  • No clear correlation yet between depth and autonomy.
  • Transparency and evaluation practices are highly heterogeneous.

This variation highlights the need for a common framework.

Experimental Evidence

We conducted two experiments to test the core principles of our framework.

  1. The Abstention Test: Can LLMs reliably say “I don’t know”?
  2. The Decomposition Test: Does breaking down tasks improve results?

Experiment 1: The Abstention Test

Objective: Assess whether an LLM will fabricate answers for an impossible task.

Design:

  • Task: Find evidence of “bicameralism” in a 7th-century letter (an anachronistic and conceptually mismatched query).
  • Conditions:
    • Constrain output to 1-10 items.
    • Provide an explicit abstention option (“There is no evidence for that!”).

Exp 1: Results

An explicit “exit path” is critical to prevent fabrication.

Instruction Range Explicit Abstention Mean Items Found
1-10 elements No 7.36
1-10 elements Yes 0.16
0-10 elements No 5.26
0-10 elements Yes 0.00


Takeaway: Without a way to abstain, the model will hallucinate to satisfy the prompt’s constraints.

Experiment 2: The Power of Decomposition

Objective: See how task decomposition can help reduce autonomy and evaluate how it may improve quality.

Design:

  • Task: Extract elements of “constitutionalism” from the same 7th-century letter.
  • Three Methods:
    • Baseline: A single, complex prompt.
    • Two-Stage: 1) Propose a coding schema, 2) Apply it.
    • Multi-Stage: 1) Propose schema, 2) Apply to dimensions in parallel, 3) Synthesize.

Exp 2: Results

Decomposition yields more detailed, stable, and auditable results.

Element Baseline Two-Stage Score Multi-Stage Score
Legal limits on rulers’ powers 9 9
Supremacy of constitutional norms 8 9
Procedural limits 9 9
Amendment rules 2 0
Consent in lawmaking 3 2


  • All methods reached a similar high-level conclusion.
  • However, the Baseline was a “black box” with limited detail.
  • Multi-Stage provided the richest, most reliable, and fully auditable analysis.

Practical Takeaway


Break the task,

bind the output,

and climb the ladder of abstraction under human gaze.

Conclusion

  • The Depth-Autonomy framework offers a structured way to design and evaluate LLM applications in social science.
  • Constraining autonomy through task decomposition is the key to achieving reliable results for high-depth interpretive tasks.
  • Explicit abstention options are crucial for preventing model fabrication and ensuring research integrity.
  • Multi-stage pipelines produce more detailed, stable, and auditable outputs than single-pass approaches.